Setting Up For Today

Welcome the Data Visualization Workshop! Data Viz plays a really crucial role in the more general data analysis workflow:

Adequately teaching how to do data summarization would be a whole entirely different workshop. You’ll still be able to work along here, but if you’re unfamiliar with R and the tidyverse, you might have to resort to just copy-pasting code as we go along.

But, for materials for learning more about the basics of R, I’d refer you to my 2017 LSA couse materials:

  1. Intro to R
  2. Data and DataFrames
  3. Split-Apply-Combine

The process of learning R

These are some of the core areas I figure are necessary to getting good at statistical modelling in R:

  1. Using R (and RStudio) well
  2. Feeling comfortable and fluid reorganizing and summarizing data
  3. Visualizing Data
  4. Deciding before you model what you want to compare to what
  5. How to translate your analysis goals into R code
  6. Understanding a little bit about statistics
  7. When something goes wrong, being able to accurately attribute your difficulty to one of the above topics

These are all skills you can achieve through practice, experience, and occasional guidance from someone more skilled than you. It is exactly like acquiring any other skill or craft. At first it will be confusing, you’ll make some mistakes, and it won’t look so good. I like comparing it to knitting.

The first hat I ever knit:

A more recent hat I knit:

The way I improved my knitting is exactly the same as how you can improve your R programming ability:

  • I knit a lot (almost every day).
  • I memorized a bunch of stuff.
  • Remembered where to look up the stuff I don’t have memorized.
  • My knitting became more “idiomatic” (i.e. I started knitting like how other knitters knit).
  • I learned how to identify and fix mistakes without undoing my entire project.
  • I developed good workspace hygiene & organization.
  • As I got the basics down, I started researching and incorporating fussy little details into my work.

R, RStudio and R Notebooks

We’re going to be using R, RStudio, and R Notebooks in this course, and it’s a little important to keep straight what these three things are:

R

R is a programming language that runs on your computer. At its barest bones, it looks like this:

You can type text into the prompt there, and if you’ve successfully memorized the right R commands, it’ll do some things.

RStudio

RStudio is like an Instagram filter over to of R, to make your R use experience better. It visually organizes some important components of using R into panes, and offers code completion suggestions. For example, if you ember there’s something called a “Wilcoxon test”, but you don’t remember what the function in R is, you can start typing in Wilc, and this will happen:

RStudio’s autocompletion is really useful for a lot of other things, like reminding you what the column names are in your data frame, what the names of all the arguments to a function are, etc.

But perhaps the most valuable component in R Studio these days is its authoring tools, like R Notebooks

R Notebooks

R Notebooks allow you to document your code in plain text, insert R Code chunks, and view the results of the R code all in one place, then compile it into a nice looking notebook.

Discussion

I’m going to recommend (for now at least) that you run all of your code though an R Notebook. It is possible to just type things into the R console, but that’s kind of like dictating a paper into thin air. Once you’ve spoken the words, they disappear and can be hard to recover.

My earlier advice would have been to write all of your code in an R script file, but that also separates the code from its results, which can be hard for beginners to keep track of.


Installing R Packages

R comes with a lot of functionality installed, but one way that R is extensible is through users’ ability to contribute new code & data through it’s package management system. We’re going to using a number of these packages in the course, especially since a few of them have fundamentally changed the way R programming works in the past 3 years. There’s also a course R package I’ve created to easily distribute sample datasets.

Here’s a basic diagram of how R packages work:

Installing Packages

install.packages()

Most R packages are distributed through CRAN (Comprehensive R Archive Network). When you run function install.packages("x"), R checks whether the package "x" exists on CRAN, and installs it on your computer if it does. You maybe asked to choose a “CRAN mirror” the first time you run install.packages(). This is because there are many copies of CRAN distributed across the internet. I’d recommend choosing the first option called 0-Cloud.

install_github()

As a package developer, getting a package onto CRAN can be a bit of a pain, so some packages (and development versions of many) are also available on GitHub, which can be easily installed with devtools::install_github("username/package").

Installing packages is different from loading packages

Installing a package is different from loading packages. Installing a package only downloads and configures the code on your computer. In order to use the contents of a package, you need to load it into your R session with library().

  • You only need to run install.packages() once to install a package, or to update a package.
  • You need to run library() at the start of every new R session in order to use the functionality from that package.

For example, ggplot() is a function from the package ggplot2. I have already installed ggplot2 on my computer, but if I try to use ggplot() before loading the package with library(), I’ll get the error that the function was not found.

foo <- ggplot()
Error in ggplot() : could not find function "ggplot"
library("ggplot2")
foo <- ggplot()

~2 Minute Activity Let’s actually load up the initial packages we’re going to use today:

library(tidyverse)
── Attaching packages ──────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.0.0     ✔ purrr   0.2.4
✔ tibble  1.4.2     ✔ dplyr   0.7.5
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(lsa2017)
library(broom)

First Principles of Plotting

Why Plot

Data visualization is an essential component of data analysis. In fact, I believe it is necessary to learn how to plot your data before you learn how to do statistical modelling. If you haven’t made a lot of graphs of your data, and have only looked at averages, correlations, and linear model results, that you don’t really understand your data.

There’s a classic illustration of this called Anscombe’s quartet, which when plotted looks like three very distinctive patterns.

But if you fit linear models to them, they have nearly identical statistical properties.

fit_lm <- function(df){
  lm(y ~ x, data = df)
}
anscomb_models <- tidy_anscombe %>%
                    group_by(series) %>%
                    nest() %>%
                    mutate(model = map(data, fit_lm),
                           model_param_df = map(model, tidy),
                           model_glance = map(model, glance))
anscomb_models %>%
  unnest(model_param_df) %>%
  arrange(term)
anscomb_models %>%
  unnest(model_glance)

It has been even more humorously illustrated recently that you can produce data sets of almost any arbitrary shape that have nearly identical statistical properties.

Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing

In fact, more and more people are starting to use statistical techniques where the only plausible way to report the results of the model is with a graphical display.

Thinking about Plotting

It’s important to think of your figures as a report of your data. Try to take as much care in producing your plots as you do your writing, or reporting of your statistics. They are as important as (or for some readers, more important than) anything else in your paper.

“Accuracy”

When making a plot, you should strive for accuracy in:

  • Accurately representing the properties of numbers.
  • Accurately representing the nature of your data.

Take this very simple data set:

group value
A 2
B 5

For our purposes, these numbers have three properties.

  1. Order: 2 < 5, or A < B
  2. Magnitude: 5 = 2.5 \(\times\) 2, or B = 2.5 \(\times\) A
  3. Contextual Magnitude: If A and B are bars, and these are measure of the cost of a pint, then A must be a real dive (and a good deal), and B must be a little bit better, but still not too fancy. If A and B are people, and these are their number of legs, then A has an unsurprising number of legs and B has a surprising number of legs.

Here is an example of an inaccurate plot:

It successfully captures the order of A and B, but fails to capture the correct magnitude of the difference. The magnitude of the difference is thrown off because the y-axis doesn’t start at 0. In this plot, the B line is 7\(\times\) longer than the A line, but the actual magnitude of the difference is 2.5\(\times\). This produces a “lie factor” of \(\frac{7}{2.5} = 2.8\).

This isn’t just a hypothetical problem either. For example, British electoral mailers are notorious for the inaccurately portraying the magnitude of differences.

Both academic researchers and the producers of these political mailers may counter by saying

But the axes were labelled accurately!

If readers would understand your data better if they ignored the graphical elements of your plot, then your plot is net-negative to accurate communication, and I can only assume that accuracy was never your primary goal anyway.

Think about how plots are read

The way in which we read the numbers off of plots also matters a lot to constructing a figure. For example, the bar plot above maps the number of MPs to the length of the bars. Another graphical feature we could use is to map the number of MPs to the position of points.

Or, we could map it to the area of circles:

Or, to the angle of a pie chart slice:

However, not all visual dimensions are as easilly decoded visially as others. Research has found that people are more accurate in their perception of numeric differences when graphs use length and position, rather than area, or angle.

Colo(u)r

We also need to think a lot about how we use color in figures. In the political graphs, there are clear and iconic colors associated with each political party, which I would recommend using in a case like that. Choosing a different color palette makes the graph more confusing.

There are many other kinds of conventionalized color-meaning mapping that you should usually stick to (like cold = blue, hot = red). But you should also be careful to avoid using conventionalized color schemes that either reenforce harmful stereotypes (like pinking and bluing gender) or could be otherwise offensive to cultural sensitivities.

You should also be careful to avoid accidentally conveying something you don’t intend to with your color choices:

There’s no sensible order to the voicing contexts that /ay/ appears in above, but one seems to be implied through the use of a gradient color scheme. The voiced context is also specially highlighted by being maped to a red hue while the rest are mapped to blue hues. Rather, we’d probably want a color palette like one of the two below, where the colors are distinct, unordered, and perceptually uniform.

ggplot2 basic concepts

Why learn ggplot2?

There are a few different graphics packages out there to use, including base R plots and lattice. ggplot2, however, is much more plugged into the tidyverse workflow, which seems to be the trending direction of R Programming. It’s also extensible, meaning people are producing a lot of really cool and really useful add ons!

Layers, Aesthetics, Geometries and Statistics

We’re going to start by working with the /ay/ dataset from the lsa2017 package.

The /ay/ dataset

This /ay/ data set contains over 80,000 tokens of the /ay/ vowel which were automatically extracted from 326 sociolinguistic interviews in Philadelphia. There are two allophones of /ay/ encoded in the data column plt_vclass

  • ay0 = pre-voiceless /ay/
  • ay = all other /ay/

Across the 20th century, the vowel quality of pre-voiceless /ay/ underwent a large change from something like [ɑɪ] to [ʌɪ]. This was a phonetically gradual change, so we’ll be plotting it according to the primary phonetic correlate of this change, normalized F1.

To kick things off, we’ll just estimate the average normalized F1

ay_means <- ay %>%
            mutate(dob = year-age)%>%
            group_by(idstring, dob, plt_vclass)%>%
            summarise(F1_n = mean(F1_n))
head(ay_means)

We’re going to take this data and build up to making this plot:

Layers

You should hopefully start looking at figures like this one like many of us look at the image below.

Those of use familiar with this kind of media know that the picture of the libarary is not what was originally capture by my phone. Rather there are multiple layers of effects, filters and text on top of the base image, which produce the final image. And in fact, some of these layers are crucially ordered. For example, the text would look different if it was added to the image first, and then the filters, instead of vice versa.

So too with the ggplot2 plot above. These plots are constructed out of layers. Every component of the graph, from the underlying data it’s plotting, to the coordinate system it’s plotted on, to the statistical summaries overlaid on top, to the axis labels, are layers in the plot. The consequence of this is that your use of ggplot2 will probably involve iterative addition of layer upon layer until you’re pleased with the results.

Aesthetics

The graphical properties which encode the data you’re presenting are the aesthetics of the plot. These include things like

  • x position
  • y position
  • size of elements
  • shape of elements
  • color of elements

Geometries

The primary visual items on the plots are called geometries and include things like

  • points
  • lines
  • line segments
  • bars
  • text

Some of these geometries have their own specific aesthetic settings. For example,

  • points
    • point shape
  • text
    • text labels
  • lines
    • line weight
    • line type

Statistics

You’ll also frequently want to plot statistics overlaid on top of, or instead of the raw data. Some of these include

  • Smoothing and regression lines
  • One and two dimensional binning
  • Mean and medians with confidence intervals.

The aesthetics, geometries and statistics constitute the most important layers of a plot, but for fine tuning a plot for publication, there are a number of other things you’ll want to adjust. The most common one of these are the scales, which encompass things like

  • A logarithmic x or y axis
  • Customized color scales
  • Customized point shapes, or linetypes

We’ll review many of these components as we build up the plot, and will circle back to more of them for greater detail.

Building the Plot

First, let’s refresh our memories of the graph we want to build.

This plot is composed of eight layers, which can be subdivided into five layer types. It’s not important for you to memorize these layer types, but it helps to structure the discussion.

Layers

The data layer

Every ggplot2 plot has a data layer, which defines the data set to plot, and the basic mappings of data to aesthetic elements. The data layer created with the functions ggplot() and aes(), and looks like this

ggplot(data, aes(...))

The first argument to ggplot() is a data frame (it must be a data frame), and its second argument is aes(). You’re never going to use aes() in any other context except for inside of other ggplot2 functions, so it might be best not to think of aes() as its own function, but rather as a special way of defining data-to-aesthetic mappings.

Also as a reminder, we’ll be working with a dataframe that looks like this:

head(ay_means)

We’ll start by mapping the dob to the x-axis, and F1_n to the y-axis.

p <- ggplot(ay_means, aes(x = dob, y = F1_n))
p

You can think of this plot as the base image, before we’ve added any extra layers, text or instagram filters to it. An important conceptual issue is that you are able to assign plots to variables (in this case, p). When you do this assignment, nothing special happens. But if you print out p, R will generate the plot.

The geometries layer

The next step, after defining the basic data-to-aesthetic mappings, is to add geometries to the data. We’ll discuss geometries in more detail below, but for now, we’ll add one of the simplest: points.

  p <- p + geom_point()
  p

There are a few things to take away from this step. First and foremost, the way you add new layers, of any kind, to a plot is with the + operator. And, as we’ll see in a moment, there’s no need to only add them one at a time. You can string together any number of layers to add to a plot, separated by +.

The next thing to notice is that all layers you add to a plot are, technically, functions. We didn’t pass any arguments to geom_point(), so the resulting plot represents the default behavior: solid black circular points.

If for no good reason at all we wanted to use a different point shape in the plot, we could specify it inside of geom_point().

ggplot(ay_means, aes(x=dob, y=F1_n)) +
  geom_point(shape = 3)

Or, if we wanted to use larger, red points, we could specify that in geom_point() as well.

ggplot(ay_means, aes(x=dob, y=F1_n)) +
  geom_point(color = "red", size = 3)

We still need to be sure to map the allophones to the color of the points, though. We’ll do this in the data layer.

p <- ggplot(ay_means, aes(x=dob, y=F1_n, color = plt_vclass)) +
    geom_point()
p

We can see a few of the default setting of ggplot2 on display here. Most striking is the light grey background, with white grid lines. Opinion varies on whether or not this is aesthetically or technically pleasing, but don’t worry, it’s adjustable.

Another default is to label the x and y axes with the column names from the data frame. I’ll inject a bit of best practice advice here, and tell you to always change the axis names. It’s nearly guaranteed that your data frame column names will make for very poor axis labels. We’ll cover how to do that shortly.

Finally, note that we didn’t need to tell geom_point() about the x and y axes. This may seem trivial, but it’s a really important, and powerful aspect of ggplot2. When you add any layer at all to a plot, it will inherit the data-to-aesthetic mappings which were defined in the data layer. We’ll discuss inheritance, and how to override, or define new data-to-aesthetic mappings within any geom.

The statistics layer

The final figure also includes a smoothing line, which is one of many possible statistical layers we can add to a plot.

  p <- p + stat_smooth()
  p

We’ll go over the default behavior of stat_smooth() below, but in this plot, the smoothing line represents a loess smooth, and the semi-transparent ribbon surrounding the solid line is the 95% confidence interval.

One important thing to realize is that it’s not necessary to include the points in order to add a smoothing line. Here’s what the plot would look like with the points omitted.

 ggplot(ay_means, aes(x=dob, y=F1_n, color = plt_vclass)) +
  stat_smooth()

Notice how the y-axis has zoomed in to just include the range of the smoothing line and standard error.

Scale transformations

I also wanted to make some alterations to the default y axis scales. The y-axis is currently running in reverse to the intuitive direction of F1. Higher vowels have lower F1 values, so we want to flip the y-axis. I also want to change the color scale, and its labels.

p <- p + scale_y_reverse()+
        scale_color_brewer(palette = "Dark2",
                           labels = c("voiced","voiceless"))
p

It’s worth noting that the smoothing line here is calculated over the transformed data.

The other kind of scale transformation you’re most likely to make would be use a log scale on data like durations:

ggplot(ay, aes(dur))+
    stat_bin()

ggplot(ay, aes(dur))+
    stat_bin()+
    scale_x_log10()

Cosmetic alterations

Finally, I wanted to make some cosmetic adjustments to the plot. For example, the axis labels all need to be renamed. I also added a title to the plot, and changed the color theme to black and white.

p <- p + labs(x = "Date of Birth",
              y = "Normalized F1",
              color = "ay/_")+
         theme_bw()+
        ggtitle("Change in /ay/ allophones")
p

Here’s how a similar version of this plot looks in print

Futher reading and exploration

For further reading on how to use ggplot2, specifically, I’d highly recommend Kieran Healy’s new book Data Visualization: A Practical Introduction.

I’d also suggest checking out:

For more general data visualization reading, you really need to at least have an opinion about Edward Tufte’s Visual Display of Quantitative Information

---
title: "Data Visualization Workshop"
output: 
  html_notebook: 
    code_folding: none
    css: custom.css
    theme: flatly
    toc: yes
    toc_float: yes
    toc_depth: 3
date: "3 April 2019"
author: "[Josef Fruehwald](https://jofrhwld.github.io)"
---




# Setting Up For Today

Welcome the Data Visualization Workshop! Data Viz plays a really crucial role in the more general data analysis workflow:



<div class = "half-img">
![](figures/workflow.svg)
</div>

Adequately teaching how to do data summarization would be a whole entirely different workshop. You'll still be able to work along here, but if you're unfamiliar with R and the tidyverse, you might have to resort to just copy-pasting code as we go along. 

But, for materials for learning more about the basics of R, I'd refer you to my 2017 LSA couse materials:

1. [Intro to R](https://jofrhwld.github.io/teaching/courses/2017_lsa/lectures/Session_1.nb.html)
2. [Data and DataFrames](https://jofrhwld.github.io/teaching/courses/2017_lsa/lectures/Session_2.nb.html)
3. [Split-Apply-Combine](https://jofrhwld.github.io/teaching/courses/2017_lsa/lectures/Session_3.nb.html)

--------------

## The process of learning R 

These are some of the core areas I figure are necessary to getting good at statistical modelling in R:

1. Using R (and RStudio) well
2. Feeling comfortable and fluid reorganizing and summarizing data
3. **Visualizing Data**
4. Deciding before you model what you want to compare to what
5. How to translate your analysis goals into R code
5. Understanding a little bit about statistics
6. When something goes wrong, being able to accurately attribute your difficulty to one of the above topics

These are all skills you can achieve through practice, experience, and occasional guidance from someone more skilled than you. It is exactly like acquiring any other skill or craft. At first it will be confusing, you'll make some mistakes, and it won't look so good. I like comparing it to knitting.

<div style="width:100%;float:left;">
<div style = "width:35%;float:left;margin-left:10%;margin-right:5%;margin-bottom:5%;">

The first hat I ever knit:

![](figures/firsthat.jpg)

</div>

<div style = "width:35%;float:left;margins:auto;margin-right:10%;margin-left:5%;margin-bottom:5%;">

A more recent hat I knit: 

![](figures/lasthat.jpg)

</div>


</div>


The way I improved my knitting is exactly the same as how you can improve your R programming ability:

* I knit a lot (almost every day).
* I memorized a bunch of stuff.
* Remembered where to look up the stuff I don't have memorized.
* My knitting became more "idiomatic" (i.e. I started knitting like how other knitters knit).
* I learned how to identify and fix mistakes without undoing my entire project.
* I developed good workspace hygiene & organization.
* As I got the basics down, I started researching and incorporating fussy little details into my work.


## R, RStudio and R Notebooks

We're going to be using R, RStudio, and R Notebooks in this course, and it's a little important to keep straight what these three things are:

### R

**R** is a programming language that runs on your computer. At its barest bones, it looks like this:

<div class = "half-img">
![](figures/2__R.png)
</div>

You can type text into the prompt there, and if you've successfully memorized the right R commands, it'll do some things.


### RStudio

**RStudio** is like an Instagram filter over to of R, to make your R use experience better. It visually organizes some important components of using R into panes, and offers *code completion* suggestions. For example, if you ember there's something called a "Wilcoxon test", but you don't remember what the function in R is, you can start typing in `Wilc`, and this will happen:

<div class = "half-img">
![](figures/codeCompletion.png)
</div>

RStudio's autocompletion is really useful for a lot of other things, like reminding you what the column names are in your data frame, what the names of all the arguments to a function are, etc. 

But perhaps the most valuable component in R Studio these days is its authoring tools, like R Notebooks

### R Notebooks

R Notebooks allow you to document your code in plain text, insert R Code chunks, and view the results of the R code all in one place, then compile it into a nice looking notebook.


### Discussion

I'm going to recommend (for now at least) that you run all of your code though an R Notebook. It is possible to just type things into the R console, but that's kind of like dictating a paper into thin air. Once you've spoken the words, they disappear and can be hard to recover.

My earlier advice would have been to write all of your code in an R script file, but that also separates the code from its results, which can be hard for beginners to keep track of. 

<hr ></hr>


## Installing R Packages

R comes with a lot of functionality installed, but one way that R is extensible is through users' ability to contribute new code &  data through it's package management system. We're going to using a number of these packages in the course, especially since a few of them have fundamentally changed the way R programming works in the past 3 years.  There's also a course R package I've created to easily distribute sample datasets.

Here's a basic diagram of how R packages work:

![](figures/cran_package.png)
 


### Installing Packages

#### `install.packages()`

Most R packages are distributed through CRAN (Comprehensive R Archive Network). When you run function `install.packages("x")`, R checks whether the package `"x"` exists on CRAN, and installs it on your computer if it does. You maybe asked to choose a "CRAN mirror" the first time you run `install.packages()`. This is because there are many copies of CRAN distributed across the internet. I'd recommend choosing the first option called `0-Cloud`.


#### `install_github()`

As a package developer, getting a package onto CRAN can be a bit of a pain, so some packages (and development versions of many) are also available on GitHub, which can be easily installed with `devtools::install_github("username/package")`.


### Installing packages is different from loading packages

**Installing** a package is different from **loading** packages. Installing a package only downloads and configures the code on your computer. In order to *use* the contents of a package, you need to load it into your R session with `library()`.

- You only need to run `install.packages()` once to install a package, or to update a package.
- You need to run `library()` at the start of every new R session in order to use the functionality from that package.

For example, `ggplot()` is a function from the package `ggplot2`. I have already installed `ggplot2` on my computer, but if I try to use `ggplot()` before loading the package with `library()`, I'll get the error that the function was not found.

```{r}
foo <- ggplot()
```

```{r}
library("ggplot2")
foo <- ggplot()
```


<div class = "box break">
<span class="big-label">~2 Minute Activity</span>
Let's actually load up the initial packages we're going to use today:

```{r}
library(tidyverse)
library(lsa2017)
library(broom)
```

</div>

-------------------


# First Principles of Plotting

## Why Plot

Data visualization is an *essential* component of data analysis. In fact, I believe it is necessary to learn how to plot your data before you learn how to do statistical modelling. If you haven't made a *lot* of graphs of your data, and have only looked at averages, correlations, and linear model results, that you don't really understand your data. 

There's a classic illustration of this called Anscombe's quartet, which when plotted looks like three very distinctive patterns.

```{r echo = F}
tidy_anscombe <- anscombe %>%
                    mutate(idx = 1:n()) %>%
                    gather(key, value, x1:y4) %>%
                    separate(key, into = c("variable", "series"), sep = 1) %>%
                    spread(variable, value)

tidy_anscombe %>%
  ggplot(aes(x, y)) + 
    geom_point(size = 3) + 
    facet_wrap(~series)
```

But if you fit linear models to them, they have nearly identical statistical properties.

```{r}
fit_lm <- function(df){
  lm(y ~ x, data = df)
}

anscomb_models <- tidy_anscombe %>%
                    group_by(series) %>%
                    nest() %>%
                    mutate(model = map(data, fit_lm),
                           model_param_df = map(model, tidy),
                           model_glance = map(model, glance))

anscomb_models %>%
  unnest(model_param_df) %>%
  arrange(term)
```
```{r}
anscomb_models %>%
  unnest(model_glance)
```

It has been even more humorously illustrated recently that you can produce data sets of almost any arbitrary shape that have nearly identical statistical properties.

![](figures/DinoSequentialSmaller.gif)
[Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing](https://www.autodeskresearch.com/publications/samestats)

In fact, more and more people are starting to use statistical techniques where the *only* plausible way to report the results of the model is with a graphical display.

```{r echo = F}
library(mgcv)
library(itsadug)
```

```{r echo = F}
ay_temp <- ay %>%
            filter(fol_seg %in% c("T", "D"),
                   context == "internal") %>%
            mutate(dob = year-age,
                   plt_vclass = factor(plt_vclass),
                   word = factor(word),
                   idstring = factor(idstring),
                   log2dur = log2(dur),
                   dur_c = log2dur - median(log2dur))
ay_bam <- bam(F1_n ~ plt_vclass + s(dob, by = plt_vclass)+
                s(log2dur, by = plt_vclass)+
                te(dob, log2dur, by = plt_vclass)+
                s(idstring, bs = 're') + s(word, bs = 're'),
              data = ay_temp)
```

```{r echo =  F}
plot_diff(ay_bam,view = "dob", comp = list(plt_vclass = c("ay", "ay0")), rm.ranef = T, print.summary = F)
```


## Thinking about Plotting

It's important to think of your figures as a *report* of your data. Try to take as much care in producing your plots as you do your writing, or reporting of your statistics. They are as important as (or for some readers, more important than) anything else in your paper.


## "Accuracy"

When making a plot, you should strive for accuracy in:

- Accurately representing the properties of numbers.
- Accurately representing the nature of your data.

Take this very simple data set:

<div style = "width:50%">

| group | value |
| ----: | ----: |
| A | 2 |
| B | 5 |

</div>

For our purposes, these numbers have three properties.

1. **Order**: 2 < 5, or A < B
2. **Magnitude**: 5 = 2.5 $\times$ 2, or B = 2.5 $\times$ A
3. **Contextual Magnitude**: If A and B are bars, and these are measure of the cost of a pint, then A must be a real dive (and a good deal), and B must be a little bit better, but still not too fancy. If A and B are people, and these are their number of legs, then A has an unsurprising number of legs and B has a surprising number of legs.

Here is an example of an inaccurate plot:

```{r fig.width = 5/2, fig.height = 5/2, echo = F}
num <- data.frame(group = c("A", "B"),
                  value = c(2, 5))

ggplot(num, aes(group, value))+
    geom_segment(aes(xend = group, y=1.5, yend=value), size=10)+
    theme_minimal()
```

It successfully captures the *order* of A and B, but fails to capture the correct *magnitude* of the difference. The magnitude of the difference is thrown off because the y-axis doesn't start at 0. In this plot, the B line is `r (5-1.5)/(2-1.5)`$\times$ longer than the A line, but the actual magnitude of the difference is 2.5$\times$. This produces a "lie factor" of $\frac{7}{2.5} = `r 7/2.5`$.

This isn't just a hypothetical problem either. For example, British electoral mailers are notorious for the inaccurately portraying the magnitude of differences.

![](figures/inaccurate.png)


```{r echo = F}
scot <- data_frame(party = c("Conservative",
                             "SNP",
                             "Lib Dem",
                             "Labour"),
                   mps = c(1, 7, 12, 39))
ggplot(scot, aes(party, mps, fill = party))+
  geom_bar(stat = "identity", color = "black")+
  xlim(c("Conservative",
         "SNP",
         "Lib Dem",
         "Labour"))+
  scale_fill_manual(limits= c("Conservative",
                              "SNP",
                              "Lib Dem",
                              "Labour"),
                    values = c("#0087dc",
                               "#FFF95D",
                               "#FDBB30",
                               "#d50000"))+
    theme_minimal()+
      theme(legend.position = "none")

```

Both academic researchers and the producers of these political mailers may counter by saying

> But the axes were labelled accurately!

If readers would understand your data better if they *ignored the graphical elements of your plot*, then your plot is net-negative to accurate communication, and I can only assume that accuracy was never your primary goal anyway. 

## Think about how plots are read

The way in which we read the numbers off of plots also matters a lot to constructing a figure.
For example, the bar plot above maps the number of MPs to the *length* of the bars. Another graphical feature we could use is to map the number of MPs to the *position* of points.

```{r echo = F}
ggplot(scot, aes(party, mps, color = party))+
  geom_point(size = 6, color = "black")+
  geom_point(size = 5)+
  xlim(rev(c("Conservative",
         "SNP",
         "Lib Dem",
         "Labour")))+
  scale_color_manual(limits= c("Conservative",
                              "SNP",
                              "Lib Dem",
                              "Labour"),
                    values = c("#0087dc",
                               "#FFF95D",
                               "#FDBB30",
                               "#d50000"))+
    theme_minimal()+
    coord_flip()+
    theme(legend.position = "none")
```

Or, we could map it to the *area* of circles:

```{r echo = F}
scot <- data_frame(party = c("Conservative",
                             "SNP",
                             "Lib Dem",
                             "Labour"),
                   mps = c(1, 7, 12, 39),
                   x0 = c(0,5,10,15),
                   y0 = rep(1, 4))

ggplot(scot)+
    ggforce::geom_circle(aes(x0 = x0, y0 = y0, r = sqrt(mps), fill = party), alpha = 0.6)+
    coord_fixed()+
   scale_fill_manual(limits= c("Conservative",
                              "SNP",
                              "Lib Dem",
                              "Labour"),
                    values = c("#0087dc",
                               "#FFF95D",
                               "#FDBB30",
                               "#d50000"))+
    ggtitle("Number of MPs")+
    theme_void()
```

Or, to the *angle* of a pie chart slice:

```{r}
ggplot(scot, aes("x", mps, fill = party))+
    geom_bar(stat = "identity", position = "stack", color = "black")+
    coord_polar(theta = "y")+
     scale_fill_manual(limits= c("Conservative",
                              "SNP",
                              "Lib Dem",
                              "Labour"),
                    values = c("#0087dc",
                               "#FFF95D",
                               "#FDBB30",
                               "#d50000"))+
    theme_void()+
    ggtitle("Proportion of MPs")
```

However, not all visual dimensions are as easilly decoded visially as others. Research has found that people are more accurate in their perception of numeric differences when graphs use **length** and **position**, rather than area, or angle.

## Colo(u)r

We also need to think a *lot* about how we use color in figures. In the political graphs, there are clear and iconic colors associated with each political party, which I would recommend using in a case like that. Choosing a different color palette makes the graph more confusing.

```{r echo = F}
ggplot(scot, aes(party, mps, fill = party))+
  geom_bar(stat = "identity", color = "black")+
  xlim(c("Conservative",
         "SNP",
         "Lib Dem",
         "Labour"))+
  scale_fill_viridis_d(limits= c("Conservative",
                              "SNP",
                              "Lib Dem",
                              "Labour"))+
    theme_minimal()+
      theme(legend.position = "none")
```


There are many other kinds of conventionalized color-meaning mapping that you should usually stick to (like cold = blue, hot = red). But you should also be careful to avoid using conventionalized color schemes that either reenforce harmful stereotypes (like pinking and bluing gender) or could be otherwise offensive to cultural sensitivities.


You should also be careful to avoid accidentally conveying something you don't intend to with your color choices:

```{r echo = F}
ay %>%
  filter(word != "I") %>%
  mutate(voicing = case_when(plt_vclass == "ay0" ~ "voiceless",
                             context %in% c("final", "coextensive")~"word final",
                             fol_seg %in% c("M", "N", "NG")~"nasal",
                             fol_seg %in% c("L", "R", "W", "Y") ~ "liquid",
                             grepl("[AEIOU]", fol_seg)~"hiatus",
                             TRUE ~ "voiced"),
         voicing = factor(voicing, levels = c("hiatus",
                                              "liquid",
                                              "nasal",
                                              "voiceless",
                                              "word final",
                                              "voiced"))) %>%
  ggplot(aes(log2(dur), color = voicing))+
    geom_density(color = "black", aes(group = voicing), size = 2)+
    geom_density(size = 1)+
    scale_color_manual(values = c("#EFF3FF", "#BDD7E7", "#6BAED6", "#3182BD", "#08519C", "red"))+
    theme_minimal()+
    ggtitle("/ay/ duration distributions")
    
```

There's no sensible order to the voicing contexts that /ay/ appears in above, but one seems to be implied through the use of a gradient color scheme. The voiced context is also specially highlighted by being maped to a red hue while the rest are mapped to blue hues. Rather, we'd probably want a color palette like one of the two below, where the colors are distinct, unordered, and perceptually uniform.
```{r echo = F}
ay %>%
  filter(word != "I") %>%
  mutate(voicing = case_when(plt_vclass == "ay0" ~ "voiceless",
                             context %in% c("final", "coextensive")~"word final",
                             fol_seg %in% c("M", "N", "NG")~"nasal",
                             fol_seg %in% c("L", "R", "W", "Y") ~ "liquid",
                             grepl("[AEIOU]", fol_seg)~"hiatus",
                             TRUE ~ "voiced"),
         voicing = factor(voicing, levels = c("hiatus",
                                              "liquid",
                                              "nasal",
                                              "voiceless",
                                              "word final",
                                              "voiced"))) %>%
  ggplot(aes(log2(dur), color = voicing))+
    geom_density(color = "black", aes(group = voicing), size = 2)+
    geom_density(size = 1)+
    scale_color_brewer(palette = "Dark2")+
    theme_minimal()+
    ggtitle("/ay/ duration distributions")
    
```


```{r echo = F}
ay %>%
  filter(word != "I") %>%
  mutate(voicing = case_when(plt_vclass == "ay0" ~ "voiceless",
                             context %in% c("final", "coextensive")~"word final",
                             fol_seg %in% c("M", "N", "NG")~"nasal",
                             fol_seg %in% c("L", "R", "W", "Y") ~ "liquid",
                             grepl("[AEIOU]", fol_seg)~"hiatus",
                             TRUE ~ "voiced"),
         voicing = factor(voicing, levels = c("hiatus",
                                              "liquid",
                                              "nasal",
                                              "voiceless",
                                              "word final",
                                              "voiced"))) %>%
  ggplot(aes(log2(dur), color = voicing))+
    geom_density(color = "black", aes(group = voicing), size = 2)+
    geom_density(size = 1)+
    ggthemes::scale_color_colorblind()+
    theme_minimal()+
    ggtitle("/ay/ duration distributions")
    
```




# `ggplot2` basic concepts

## Why learn `ggplot2`?

There are a few different graphics packages out there to use, including base R plots and `lattice`. `ggplot2`, however, is much more plugged into the tidyverse workflow, which seems to be the trending direction of R Programming. It's also extensible, meaning people are producing [a lot of really cool and really useful add ons](http://www.ggplot2-exts.org)! 





## Layers, Aesthetics, Geometries and Statistics

We're going to start by working with the /ay/ dataset from the `lsa2017` package.

<div class = "box break">
<span class="big-label">The /ay/ dataset</span>

This /ay/ data set contains over 80,000 tokens of the /ay/ vowel which were automatically extracted from 326 sociolinguistic interviews in Philadelphia. There are two allophones of /ay/ encoded in the data column `plt_vclass`

- `ay0` = pre-voiceless /ay/
- `ay` = all other /ay/

Across the 20th century, the vowel quality of pre-voiceless /ay/ underwent a large change from something like [ɑɪ] to [ʌɪ]. This was a phonetically gradual change, so we'll be plotting it according to the primary phonetic correlate of this change, normalized F1.
</div>


To kick things off, we'll just estimate the average normalized F1 

```{r}
ay_means <- ay %>%
            mutate(dob = year-age) %>%
            group_by(idstring, dob, plt_vclass) %>%
            summarise(F1_n = mean(F1_n))
```

```{r}
head(ay_means)
```

We're going to take this data and build up to making this plot:

```{r echo = F, message = F, fig.width=8/1.25, fig.height = 5/1.25}
 ggplot(ay_means, aes(dob, F1_n, color = plt_vclass))+
    geom_point(size = 3, alpha = 0.6)+
    stat_smooth(size = 1)+
    scale_y_reverse()+
    scale_color_brewer(palette = "Dark2",
                       labels = c("voiced",
                                  "voiceless"))+
    labs(x = "Date of Birth",
         y = "Normalized F1",
         color = "ay/__")+
    ggtitle("Change in /ay/ allophones")+
    theme_bw()
```

### Layers

You should hopefully start looking at figures like this one like many of us look at the image below.

<div class = 'half-img'>
![](figures/Enlight9.jpg)
</div>

Those of use familiar with this kind of media know that the picture of the libarary is not what was originally capture by my phone. Rather there are multiple layers of effects, filters and text on top of the base image, which produce the final image. And in fact, some of these layers are crucially ordered. For example, the text would look different if it was added to the image first, and then the filters, instead of vice versa.


So too with the `ggplot2` plot above. These plots are constructed out of __layers__. Every component of the graph, from the underlying data it's plotting, to the coordinate system it's plotted on, to the statistical summaries overlaid on top, to the axis labels, are layers in the plot. The consequence of this is that your use of `ggplot2` will probably involve iterative addition of layer upon layer until you're pleased with the results.


### Aesthetics

The graphical properties which encode the data you're presenting are the __aesthetics__ of the plot. These include things like

- x position
- y position
- size of elements
- shape of elements
- color of elements


### Geometries

The primary visual items on the plots are called __geometries__ and include things like

* points
* lines
* line segments
* bars
* text

Some of these geometries have their own specific aesthetic settings. For example,

* points
    * point shape
* text
    * text labels
* lines
    * line weight
    * line type
  
### Statistics

You'll also frequently want to plot __statistics__ overlaid on top of, or instead of the raw data. Some of these include

* Smoothing and regression lines
* One and two dimensional binning
* Mean and medians with confidence intervals.


----

The __aesthetics__, __geometries__ and __statistics__ constitute the most important __layers__ of a plot, but for fine tuning a plot for publication, there are a number of other things you'll want to adjust. The most common one of these are the __scales__, which encompass things like

* A logarithmic x or y axis
* Customized color scales
* Customized point shapes, or linetypes

We'll review many of these components as we build up the plot, and will circle back to more of them for greater detail.



# Building the Plot

First, let's refresh our memories of the graph we want to build.

```{r echo = F, message = F, fig.width=8/1.25, fig.height = 5/1.25}
 ggplot(ay_means, aes(dob, F1_n, color = plt_vclass))+
    geom_point()+
    stat_smooth()+
    scale_y_reverse()+
    scale_color_brewer(palette = "Dark2",
                       labels = c("voiced",
                                  "voiceless"))+
    labs(x = "Date of Birth",
         y = "Normalized F1",
         color = "ay/__")+
    ggtitle("Change in /ay/ allophones")+
    theme_bw()
```

This plot is composed of eight layers, which can be subdivided into five layer types. It's not important for you to memorize these layer types, but it helps to structure the discussion.

## Layers

### The data layer

Every `ggplot2` plot has a data layer, which defines the data set to plot, and the basic mappings of data to aesthetic elements. The data layer created with the functions `ggplot()` and `aes()`, and looks like this

```{r eval=F}
ggplot(data, aes(...))
```

The first argument to `ggplot()` is a data frame (it _must_ be a data frame), and its second argument is `aes()`. You're never going to use `aes()` in any other context except for inside of other `ggplot2` functions, so it might be best not to think of `aes()` as its own function, but rather as a special way of defining data-to-aesthetic mappings.


Also as a reminder, we'll be working with a dataframe that looks like this:

```{r}
head(ay_means)
```

We'll start by mapping the `dob` to the x-axis, and `F1_n` to the y-axis.

```{r}
p <- ggplot(ay_means, aes(x = dob, y = F1_n))
p
```
You can think of this plot as the base image, before we've added any extra layers, text or instagram filters to it. An important conceptual issue is that you are able to assign plots to variables (in this case, `p`). When you do this assignment, nothing special happens. But if you print out `p`, R will generate the plot. 


### The geometries layer

The next step, after defining the basic data-to-aesthetic mappings, is to add geometries to the data. We'll discuss geometries in more detail below, but for now, we'll add one of the simplest: points.

```{r fig.pos = "center",fig.width = 8/1.5, fig.height=5/1.5}
  p <- p + geom_point()
  p
```

There are a few things to take away from this step. First and foremost, the way you add new layers, of any kind, to a plot is with the `+` operator. And, as we'll see in a moment, there's no need to only add them one at a time. You can string together any number of layers to add to a plot, separated by `+`.

The next thing to notice is that all layers you add to a plot are, technically, functions. We didn't pass any arguments to `geom_point()`, so the resulting plot represents the default behavior: solid black circular points.

If for no good reason at all we wanted to use a different point shape in the plot, we could specify it inside of `geom_point()`.

```{r fig.pos = "center",fig.width = 8/1.5, fig.height=5/1.5, tidy = F}
ggplot(ay_means, aes(x=dob, y=F1_n)) +
  geom_point(shape = 3)
```

Or, if we wanted to use larger, red points, we could specify that in `geom_point()` as well.
```{r fig.pos = "center",fig.width = 8/1.5, fig.height=5/1.5, tidy = F}
ggplot(ay_means, aes(x=dob, y=F1_n)) +
  geom_point(color = "red", size = 3)
```

We still need to be sure to map the allophones to the color of the points, though. We'll do this in the data layer.

```{r fig.pos = "center",fig.width = 8/1.5, fig.height=5/1.5}
p <- ggplot(ay_means, aes(x=dob, y=F1_n, color = plt_vclass)) +
    geom_point()
p
```



We can see a few of the default setting of `ggplot2` on display here. Most striking is the light grey background, with white grid lines. Opinion varies on whether or not this is aesthetically or technically pleasing, but don't worry, it's adjustable. 

Another default is to label the x and y axes with the column names from the data frame. I'll inject a bit of best practice advice here, and tell you to _always_ change the axis names. It's nearly guaranteed that your data frame column names will make for very poor axis labels. We'll cover how to do that shortly.

Finally, note that we didn't need to tell `geom_point()` about the x and y axes. This may seem trivial, but it's a really important, and powerful aspect of `ggplot2`. When you add any layer at all to a plot, it will __inherit__ the data-to-aesthetic mappings which were defined in the data layer. We'll discuss inheritance, and how to override, or define new data-to-aesthetic mappings within any geom.


### The statistics layer

The final figure also includes a smoothing line, which is one of many possible statistical layers we can add to a plot.
```{r fig.width = 8/1.5, fig.height=5/1.5}
  p <- p + stat_smooth()
  p
```

We'll go over the default behavior of `stat_smooth()` below, but in this plot, the smoothing line represents a loess smooth, and the semi-transparent ribbon surrounding the solid line is the 95% confidence interval.

One important thing to realize is that it's not necessary to include the points in order to add a smoothing line. Here's what the plot would look like with the points omitted.

```{r fig.width = 8/1.5, fig.height=5/1.5, tidy =F}
 ggplot(ay_means, aes(x=dob, y=F1_n, color = plt_vclass)) +
  stat_smooth()
```


Notice how the y-axis has zoomed in to just include the range of the smoothing line and standard error.

### Scale transformations
I also wanted to make some alterations to the default y axis scales. The y-axis is currently running in reverse to the intuitive direction of F1. _Higher_ vowels have _lower_ F1 values, so we want to flip the y-axis. I also want to change the color scale, and its labels.


```{r tidy = F,fig.width = 8/1.5, fig.height=5/1.5}
p <- p + scale_y_reverse()+
        scale_color_brewer(palette = "Dark2",
                           labels = c("voiced","voiceless"))
p
```


It's worth noting that the smoothing line here is calculated over the _transformed_ data.


The other kind of scale transformation you're most likely to make would be use a log scale on data like durations:


<div style="width:100%;float:left;">
<div style = "width:35%;float:left;margin-left:10%;margin-right:5%;margin-bottom:5%;">

```{r tidy = F,fig.width = 8/3, fig.height=5/3}
ggplot(ay, aes(dur))+
    stat_bin()
```

</div>

<div style = "width:35%;float:left;margins:auto;margin-right:10%;margin-left:5%;margin-bottom:5%;">

```{r tidy = F,fig.width = 8/1.5, fig.height=5/1.5}
ggplot(ay, aes(dur))+
    stat_bin()+
    scale_x_log10()
```

</div>


</div>


### Cosmetic alterations
Finally, I wanted to make some cosmetic adjustments to the plot. For example, the axis labels all need to be renamed. I also added a title to the plot, and changed the color theme to black and white.

```{r tidy = F,fig.width = 8/1.5, fig.height=5/1.5}
p <- p + labs(x = "Date of Birth",
              y = "Normalized F1",
              color = "ay/_")+
         theme_bw()+
        ggtitle("Change in /ay/ allophones")

p
```


Here's how a similar version of this plot looks in print


![](figures/ays_means_plot-1.png)



# Futher reading and exploration

For further reading on how to use ggplot2, specifically, I'd highly recommend Kieran Healy's new book *Data Visualization: A Practical Introduction*.

<div class = "half-img">

[![](figures/dv_cover_tiny.png)](https://www.amazon.com/gp/product/0691181624/ref=as_li_tl?ie=UTF8&tag=kieranhealysw-20&camp=1789&creative=9325&linkCode=as2&creativeASIN=0691181624&linkId=16d53b3cc1ec3bc3aac60b27c29b92e8)

</div>

I'd also suggest checking out:


- The [ggthemes package](https://www.ggplot2-exts.org/ggthemes.html)
- The [ggrepel package](https://www.ggplot2-exts.org/ggrepel.html)
- The [ggforce package](https://cran.r-project.org/web/packages/ggforce/vignettes/Visual_Guide.html)
- This [post on the viridis scale](https://rtask.thinkr.fr/blog/ggplot2-welcome-viridis/)

For more general data visualization reading, you really need to at least have an opinion about Edward Tufte's *Visual Display of Quantitative Information*

<div class = "half-img">
[![](figures/vis_display.png)](https://www.edwardtufte.com/tufte/books_vdqi)
</div>



